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ABSTRACT 

Jfhalyses of student responses /to Introductory" 
Psychology test questions were discussed. The 'publisher- supplied a 
two thousand item test bank on computer tape. Instructors selected 
questions for fifteen item tests. The test questions were labeled by 
the publisher .as factual or conceptual. The semester course used_$ 
mastery learning format in which repeat testing was conducted using* 
altirpate test forrrts. Standard item analysis included the percentage 
pf students passing and /correlation coefficients. The /second t analysis 
Vas based on a technique .used in, designing the <New Jersey College 
Basic Skills Placement Test. Each' question was described as a ^ 
function of the probability of a correct response for students at , 
/five ability levels- as defined fey total^ test scores. A difficulty , 
/curve for each iltera result*^ which allowed the/ instructor" to see if 
the question discriminated equally well among student&/at each 
ability level or whether the item presents problems only to students 
in a particular ability range. Test items with' identical correlations 
with total test scores, often yielded different difficulty curves. 
Difficulty curves with several different ?lppe types -were analyzed as 
a function of the syntactic "frame" of the ,q\iestion. -(Author/DWH) 
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/ ' Item Analysis of Publisher-Supplied Test Questions ^ 
, In Introductory Psychology 

Anthony D. J-btkus * 1 

and , , * 

^ George Laskaris. 

Rutgers University"- Newark > - * 

In 1932 Carl Brigham, father of the College Board's Scholastic Aptitude 

Test, published a small red volume titled "A Study of Error." Reviewers . 

with almost, fifty years hindsight (Donlon , ~1979 ; Findley,- 1981) point out 

to us* thajt the hard, nosed, data oriented Brigham was actually following in ^ 

the "process 11 tradition of Binet wh.en he suggested* that one can (and indeed 

4 H _ « » 

should) learn 'something about the workings of the mind by considering the 
patterns 'of wrong answers to multiple*choice test questions. Brigham 
actually developed diagrams much like a cc^putef programmer's flow chart? 
' 'to represent possibly thought -patterns in the responses to his multiple- 
choice 'items . Today we wjDuld call 'him a "cognitive" psychblogist . Brigham 
referred to his work simply 'as "digging." and invited pthers to do the same. 
It is in this spirit that we present ^ttie results^of our "digging" into a 
945 item test bank of multiple-choice' test questions for Introductory * 

k ( Psychology. * • ; ' 

■ • 

VirtualTy every commercially successfully Introductory Psychology text 

* v pr^vic}es the instructor a test file of multiple-choice questions. For the 

'instructor with a Urge "intro" class, these test files are at once a 

' * . s 

blessing *and a bane. Every publisher lauds the qualif icatiorfs of his test 

maker but few provide data on expected difficulty in\quantitative terms.. 

. • - "4 
JTfius,* the instructor who selects his ques^cms from su<jtt a fite must make 

* * ■* » 

• "seat of the pants"' guesses in editing-out suspect items-.. Less fortunate 
>»• . t * * \ 

is the instructor employing^' mastery learning |>r test-retest s'ystem.* He ^ 

; .v-i ' • 7 : - . f < 

{ This research was sponsored, in part / by a grant -,frpm the Rt/t/gei;s 

V * . University 'Council on Instructional Development^ « w # < 



or she has no choice but to use almost a>ll the supplied items in order to 
generate alternate test s forms. The writers, were in the latter situation 
but the use of a computer-managed mastery learning format has allowed us 
some hindsight into relative item difficulty/ * 
Sub jects 

One hundred and. s^ventyrfive students in two one-semester introductory 

' psychology classes provided the darta by answering varying proportions of t\\e 

t 

945-question test v bank. Students self-selected the psychology course 

but^ did 'not know in advance that it>w£uld have a mastery learning formal:. 

Fifty-seven % were men and 43% werq. women. Students from this urban. 

•commuter campus at Rutgers-Wewark have sat scores slightly above the 

national average for four-year colleges. $ 
* 

. t 

Procedure * ' % . 

* 4 • . ■ ' ' 

The course required that students attempt'the tests (f5 items each) on 

at least 14 of 20 chapters in the te'xt^ Three different 15-\tem forms were 

available rn random or^iei; for^ach chapter. Students who did^iot'meet the 

criteria for "mastering 11 a chapter (i3'of 15 correct) thus had two' additional 
V 

attempts available'on the same" material . The number of students responding 

' \ ; ( A 

to-each question in the bank varied ffom 114 to as few as°33 on some alternate 

forms of optional chapters., % 

Students tested on a self-paced schedule. They recorded ^iheir responses 

to the four-choice test que s (ions* on "a Istandard- type scan form. The forms 

* * \ * ' * . 

were optically scanned and the results processed by" an oh^line computer % . 

i w + 

program as, the student waited. Within a minute the student could v^ew £he 

* . * * 
results on a CRT computer terminal. Thel screen displayed the item nupbers < 



* \ • • • 

of all wrong answers, the' correct answer \and* the page 



numbers wr the^ te'xtbook 



where both the dorrect/ answ.et knd the dispr-cTctors could be found. St;u/dents 

r / \ .1 *• • " " ••• • 

were, atiowed to keep their test .forms whi lie- they , -Viewed" .their results .-.-~A14._ 

\ V 1 'u ' " ( i - \ 

* - » \ . . ' % " ' r 
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the. error jdata were saved by the program for the present analysis. 

Th^ text used in t|he course was the,second edition of Hall, Lindzey, 
and Thompson 1 s » Psychology , published by* Worth in 1978. The 2000 item test 
bank was supplied by the publisher on a cpmputer tape and questions for the 
15-item tests were selected by the instructors and printed fox; student use. 
The publisher cjescribed the following features of .the test bank: 

} * 1) Items were labeled-&s being, either "factual 11 or ' 

' " ( \ • / 

"conceptual . 11 * . . • - . - 

• • * 

• 2) The "conceptual" items emphasize an understanding 



v / ' ' of the textbaok material as it applies to situations • ( 

. t ^ and examples not a given in^the text. Many of*the ; 

conceptual questions present an example and require 

"* ' . } 

* , ■ ' ' > the student to pick a ^erm that, best fits the & 

• example. Others present a term and require the , 

\ student to select an example "that best illustrates ^ . 

*' the te#m. Iptill other conceptual items require - * * 

v * ..... 

t « the student to make predictions based on information 

• j presented in the text,, to integrate ideas, to 1 splve 

' ' 

analogies, dr to apply the learned, information to real 
"i - • * ^ 

♦ t ^ _ or hypothetical situations not 'specif icajlly mentioned 



in the textbook* 



3) TJjj^e factual questions require recognition of specific 
facts ,* definitions , or information presented in the 




text. 
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Results \ - . ' * - • . 

»* Two different modes of item analysis were used. The first or - - 
••standard" item analysis can be reviewed quickly.. Figure. 1 sfcbws a frequency^ 
histogram of the~"percent passing" for the 945. items. Tho mean^peTcent " 

^ ... . :> . . ' ■ • • 



parsing for &\ items was 67.62%— very close to tie traditional "ideal" 

average of a 70% M C rt Irade !for a class. . As tan be seen in Figure 1, there 

wer4 very few, "poor" questions, only p. 6% otf all the items had passing 

« * 
rates below 50%. The modal category for the test bank was 70-79% correct. 

v Since all the item responses w£re saved £n a compute^ Jfile , we were * 

able to complete several analyses related to the publisher's stated 

tharacteristics of the test item bank. Analysis of variance revealed, 

interestingly, that ther e was n o difference in overall difficulty between 

the questions called, "contepprfal" 'and those labeled "factual." It would 

probably be. of li ttlfeyv^aliie f^r an instructor to spend time ^mixing, maCching 

or proportioning his''tests on the basis of these, labels--a strategy thai, 

jone might have been tempted to use if one expected "conceptuals" to be more 

- * w • < * v 

difficult or thought ( provoking than "facbual 11 ^ types . * ^ 

. There was also no difference in mean item difficulty by chapter... „lnis # 

/is an important 'point &ince it suggests that the questiQrv-analysis to follow 

here can be considered content-independent; , ¥he cffapteiV niean percent passing 

ranged from' 74,28% to, 64". 01% for the nineteery P chc v pters , Ranking- the * first 

fivechapters yieldf the •fibllowing orde-r ^ith easiest first:' (D^Sex, 

(2) Memory, (3)/L«tfiguage, '(4)" Behavijor Disorders and (3) Evolution and. IJer edit 

The chapter on human sexuality was, inc idently ,»not ^required reading. * 

'The second analysis 'involves a technique recently used to.edit items % 
/ * J ■ • 

for the New Jersey St|ate .College Basic 6 Skills Placement Test. Test- 

questions are usually chara'ct rized by a single summary statistic , typically 

"percent ( p&§sing" and/or. the point biserial#correlation coefficient. These 

♦single defecriptfcrrs^dp nbt.give the te,st maker information abotft the question 
\ * * ' - * ■ * * ' ' . . 

Jiscriminaeing power acrpss the tange of students . In other words, the fact 

- that. a question was, answered correctly sixty percent of the time; could' mean 

that equally 60% of A, B.-C and D level ^tudents passed it or that some 



combination of the A and B grade students passed and all* the C artd D 

students failed. One way to achieve more Retailed information about* what 

a test question does is to plot item/test regression curves. Figure 2 J 

• * 
illustrates such a curve. 

The ordinate shows the percentage of students who passed the item. 

The abscissa is divided into r five categories based on the student's total 

test score on the form in which the test item appeared. 'Each^ curve 

> 

represents -the results for one question. If the ff A u students, (as defined 
by the interna^ priterion of their total test scor$)^have the highest 

I • . . • 

probability of passing the item and the ,! B n students th& next highest 

probability aiid so on, down to the pooresti students, we would expect the 

linear ascending function shown in Figure 2. In fact Figure 2 is what^the 

"average 11 question does took like for this test* bank/ All the item data 
* * 

were summated to generate Figure 2. The bafttfs around each point indicate 
the standard deviations -for the 9^45 items. Approximately one-third of 

../)..■' v : , > 

the items plot within one standard deviation of this graph. 

% An, easy question, or one with a "ceiling? 1 effect i^ pVcit-ted in Figure 3. 
A*, B, C and sometimes evjen D graded students are highly successful on this 

i 

type of question. It does not* discriminate well across £he range of students 

...... - .. < i 

An unfairly difficult, question would have the slopevtype shown in 

k 

Figure 4. We termed this 4 "flat-low" slope. * Thi^kindf of question also 
is ^ poo^discriminator in that everyone misses it — regardless of ability. , 

The fourth slope type. is shown in Figure 5. He^e the "A" and "B" 
students fyave reasonable probabilities of sucdess — at least compara^bje ,to 
the lineaf ^lope frorit ^Figure 2 — but the a "C" and "D" students are at roughly 

** » o ^ , 

t$§*cha»cd 4 l£vel\ It is interesting to note that one could not predict 
whether a question was a "higt^end discriminator" or a "linear"' type by 0 
knowing the single -statistic of average percent passing. 



J! — t , 

We discbvered each of these four curVe types in the question bank by 

I . 

having the- computer actually pTot a!bout,a third of the items. Other types 
of slopes are possible; for "example a "medium-level^-discriminator! 1 — given 
a different type of test. A placement test or minimum skills test where the 
questions can be hierarchically ordered (as in algebra) would tend to have 
medium *1 eve 1 discriminators (Dass and Pine, 1981). 

Once we had sampled the types of slopes, in the item, bank we wrote an ; 
algorithm to describe each curve mathematically. The computer was then 
used as a pattern recognition device to loop through the data for all the 
items and identify those 'questions whict^ resembled each type within the 
bounds of .75 of a standard deviation at^each category .* % Approximately 40% 



of the questions "fit" into one of these* four slope types. sThis capture 
ratio could be improved if wider confidence limits are chosen but ,then the 

r 

slope types tend to loose definition and overlap. 

Table 1 presents a summary of the .descriptive data^for each of the 

V 

slope" types. Notice that the point biserial correlation coefficients range 
widely within each slope type and -consequently would not give the tester a 
cliie as to what kind of slope a given question might generate. The point 
biserial correlation coefficient is frequently used as a measure or the 
appropriateness of a test item* relative to the total test. The Educational 
Testing Service (Hecht and Swineford p 1981) operates under the convention 
that moderate levels of me&n biserial correlations (between .40 and .55) - 
are "good 11 and low levels (less'than .25 or .30) are su'spect. Our data 
\ndicat^ that either level may yield a useful item, if you know 'its slope 
type. ^ ^ * ■ 

In Table 1 it can *be §een that knowing the percent ' passing rate can be, 
' ^ uselftil .indicator ' of slope type but only if the rate is very, high (ceiling 
type. slope) or very low. Consequently we asked the question whether slopg , 



types on the item/te3t regression curves can be .predicted by any other 

ft * » * 

variables. Since the publisher's question writers deliberately used 

several- question "frames 11 across all the items in this test bank ^e 

decided to code the questions 'according to their "form 11 (not t contenf ) and 

lobk for proportional patterns within s t lope types. 

The question forms or frames we coded were: 

* ' • 

o ' • * 

P Example' into Term - The student is given a hypothetical exaimle 

I • and ^required to choose the term- that best fits the example*. 

^ »An»investigator measures the speed 'at which a- frog catches a fly. 
The researcher is demonstrating ^ 

1. Repeatability. J> t * 

2. Quantification. * 

3. Subjectivity. 

4. Communication » „ 

'I ^ 
Two Blanks - A question with two fill-ins; the pair* to be selected 

! ' 

as a unit from four chorees; 

Kinsey reported differences among women's sexual behaviors. In 

general, college women engaged in more and lesjs than 

less-educated women. 9 

. 1. Premarital intercourse; homosexual betfavior 

2. Homosexual behavior; jnast;urbabion* 

3. Petting; masturbation ' "Jl 

4. Masturbation; premarital intercourse 

De f initioris ' - The Stfcdent has to recognize^ a definition either in the 

question body or from among the multiple choices* 

Psychoanalysis is • • ^ * v 

1. A method of studying ^behavior . 

. k Z- A theory of behavior. 

* 3. Both a method of studying behavior ^nd a theory of behavior. 

4« A survey method. * 

Recognizing facts or findings - The student must find or recognize jl 

1 fact or result of a stud\f: ' 

» * *. *v 

. The portion of the central nervous system that has been shown to play 
a role in impotenc^ and certain fetishes .is the 



1=3 ' . 

8 

{ ' - , - v • ■ * " • 

1 Septum. • I , * 

2. Frontal Lobe. ' 
3". Hypothalamus. ^ * ^ , 
4. Temporal Lobe. " 

Integrations of Ideas - A question involving several conceptual^ teps # 

such as combining knowledge of several definitions or findings 

and relating them to material not in the text* • ♦ 

I * 4 

A normal adult requests a "split-brain" operation in order to" increase 
her ab i Li tyHfcT perform two talks' simultaneously. A major argument 
against granting her request is that after such an operation 

r 

* 

1. He* walking/ running, aitd balance would be impaired. 
* 2.*She would be less intelligent. , * " ^ 

3. Her speech would be disrupted. { 

Jk. The processes in each of 'the two halves of*her brain'^zaald 
» „ * ' not be -coordinated . # v 

Prediction - The student" is required to forsee result$ of a stated 

manipulation or to chose alternatives. 

Jeff has a genetic predisposition toward 'TB. The 1 ike lihood 4 that he 
will contract the disease is s $ • % 

Low, but -only if he is careful. 

2. High, because of his susceptibility. 

3. High, if he is in contact* with his .relatives . 

4. Low7 beca-use .of improved living conditions. * 

j 

* 1 Not true of - This, category include^ questions where the student is 

■» «, 

* ; call^d^upon to do. an exclusiorVproces's . Questions here include 

' a N ^ sentences such as "^11 but which of" or "which 'is not true of... • 

» Kinsey'fc concrlusions regarding differences between men and women 

^ * , included all but which of the following? ^ 

1. Men Veact mqrg to? erotic stories. i ' ^ , 

2. Men talk more about sex: 

3. ^£n prefer more diverse forms of bodily stimulation, 
y 4,*Mei* engage in more sexual fantasies. 

The relationship between these question frames and three of the item slope 

' types is shown in Figures 6, 7, and 8. The percentage of 'each question frame tyj>e 

found within eacfc slope category is plotted on the vertical axis. Note 

that both' the linear type slope and the ceiJdng type contain large m 



■1 



tt 



proportions of "recognizipg facts. 11 and f, example-into-terras~. 11 The' pattern 
for a "higftv^nH discriminator 11 slope is different^. The relative proportion 
of "example into term 11 .frame types remains large but the prpportiori of 
factual recognitions decreases. ThisSiif ference in. patterns fits with 
the general notion that if a question discriminates only among the better 
students it: prob'ably involves some additional thought processes or mental 
steps beyond recognition of facts. This Seems tp be supported ^mainly t)y 
the^decrease in proportion of f5t twit recognitions but ,> un£ortunatel| , 
'not by any substantial increase in question frames that might require 
more 'Complex (2 blank, integration, or prediction) thinking. 
Conclusion jv • 



Based on our analyses we would recommend that tes-t item publishers ^ • 
refrain- from labeling their questions by unvalidated categories like factual 

4 

vs copceptual. Second,, we would like to see publishers furnish data on 

expected Npercent .pass ing* for each question. Better still would be informati 

* ft* 

on expected slope type^gener^-ted by each- question . Siflc'e many publishers 
claim to pre-test "their questions, this recommendation* may not be as costly 
•as »it ^at first se*ems. Armed*with a knowledge of the expected £lope types 
f an instructor could construct tests that woul>d contain a known jnix tff 
♦ajfgurage (linear) questions, as %'&\\ as hfgh-end^discr j.minators that would 
challenge the better students. Questions with ceiling typx> slopes could 
be judiciously sprinkled in. as the "gifts 11 they are. 

For the instructor f aced""with* selecting questions 'from a test bank 
without *such supplementary ^statistics we cannot yet -o v f fer a definitive 
method for moving from question fram4 to predicting &i.ope*type since all 
frame types can be found within e^ch^kind of slope. The important variable 
is the proportion of frame types that make^ up the test. We would suggest 



/ 



10*. 



that, quest ion s~rnvolving the simple recognition of facts or findings' should 
'make up-oo more than 35 to *^&% of a cpllege level, test. Que«ti(fcs involving 
going from 'examples into 'terms, o£ manipulating information in a way not 
/ . found in the text are more likely to* yield "high-end discriminators" 4 that m 
will challenge the better student: ; • $ . , 
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TABLE 1 



a 1 



SfWr OF QUESTION SLOPE TYPES* 
Alt) THEIR CHARACTERISTICS 



SLOPE 
TYPE 



PASSING 



POINT BISERIAL % CONCEPTUAL/' 
% OF QU E STION S COEFFICIENT RANGE % FACTUAL 



Linear 



69.5% 



19,0% 



' .lYo .7 



'6/55. 



Ceiling . 89.7% 



High-End 

Discriminator 62.3% 



8.5% ^ * -A to .7 



10.5%' 



.1 TO .9 



55/45 



38/62 



Flat Low 



28.5% 



2.0% + ' -.6 to .5 



* On item/test regression curves. 
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Figure 1. Frequency of/test 'questions grouped by percent passing, 
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Figure 4. Sample item/test regression line for questions having flat-low slopes, 
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